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Random article tf-idf, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect 
oe o pedia how important a word is to a document in a collection or corpus.l1!8 It is often used as a weighting factor in 
vihta Eia information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears 
Interaction in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some 
Help words are generally more common than others. 
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Community portal Variations of the tf—idf weighting scheme are often used by search engines as a central tool in scoring and ranking a 
Recent changes document's relevance given a user query. tf—idf can be successfully used for stop-words filtering in various subject 
Contact page fields including text summarization and classification. 
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One of the simplest ranking functions is computed by summing the tf—idf for each query term; many more 
sophisticated ranking functions are variants of this simple model. 
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Motivation [edit] 


Suppose we have a set of English text documents and wish to determine which document is most relevant to the 
query "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", 
"brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of 
times each term occurs in each document and sum them all together; the number of times a term occurs in a 
document is called its term frequency. 


However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to 
use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". 
The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less 
common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the 
weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. 


Definition [edit] 


tf—idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining 
the exact values of both statistics exist. In the case of the term frequency tf(t,d), the simplest choice is to use the raw 
frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw 
frequency of t by f(t,d), then the simple tf scheme is tf(t,d) = f(t,d). Other possibilities includel2}118 


e Boolean "frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise; 

e logarithmically scaled frequency: tf(t,d) = log (f(t,d) + 1); 

e augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the maximum 
raw frequency of any term in the document: 


H(t d) = 0.5 4 0.5 x f(t, d) 
oes" max{f(w,d) :w €d} 
The inverse document frequency is a measure of how much information the word provides, that is, whether the 


term is common or rare across all documents. It is the logarithmically scaled fraction of the documents that contain the 
word, obtained by dividing the total number of documents by the number of documents containing the term, and then 
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taking the logarithm of that quotient. 
idf(t,D) =] oe 
Ce ere eal 
with 
e NV: total number of documents in the corpus 
© lid €E Boe ae = d}| : number of documents where the term + appears (i.e., ti(t, d) £ (}). If the term is not in 
the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 


l+|{de D:ted}|. 


Mathematically the base of the log function does not matter and constitutes a constant multiplicative factor towards 
the overall result. 


Then tf—idf is calculated as 
tfidf(t,d, D) = tf(t, d) x idf(t, D) 


A high weight in tf—idf is reached by a high term frequency (in the given document) and a low document frequency of 
the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio 
inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 
0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer 
to 0. 


Justification of idf [edit] 


idf was introduced, as "term specificity", by Karen Sparck Jones in a 1972 paper. Although it has worked well as a 
heuristic, its theoretical foundations have been troublesome for at least three decades afterward, with many 
researchers trying to find information theoretic justifications for it [9] 


Sparck Jones's own explanation didn't propose much theory, aside from a connection to Zipf's law] Attempts have 
been made to put idf on a probabilistic footing, M by estimating the probability that a given document d contains a 


term fas 


{dE D:t Edy} 
Pild) = — p 


so that we can define idf as 
idf = — log P (t|d) 
= jp 
P(t]d) 
N 
HdeD:tEedy}| 


This probabilistic interpretation in turn takes the same form as that of self-information. However, applying such 


= log 


information-theoretic notions to problems in information retrieval leads to problems when trying to define the 
appropriate event spaces for the required probability distributions: not only documents need to be taken into account, 
but also queries and terms_|3] 


Example of tf-idf [edit] 


Suppose we have term frequency tables for a collection consisting of only two documents, as listed on the right, then 
calculation of tf—idf for the term "this" in document 1 is performed as follows. 


Tf, in its basic form, is just the frequency that we look up in Document 1 Document 2 
appropriate table. In this case, it's one. Term Term Count Term Term Count 
Idf is a bit more involved: this 1 this 1 
5 a N is 1 is 1 
idf (this, D) = log ———_____ 
{de D:ted}| a 2 another |2 
The numerator of the fraction is the number of documents, which sample | 1 example 3 


is two. The number of documents in which "this" appears is also 
two, giving 


2 
idf(this, D) = log 5 = 0 


So tf—idf is zero for this term, and with the basic definition this is true of any term that occurs in all documents. 


A slightly more interesting example arises from the word "example", which occurs three times but in only one 
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document. For this document, tf—idf of "example" is: 


tf(example, d2) = 3 
2 
idf (example, D) = log hs 0.3010 
tfidf(example, dz) = tf(example, d2) x idf(example, D) = 3 log 2 = 0.9030 


(using the base 10 logarithm). 


See alSO [edit] 
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External links and suggested reading [edit] 


Gensim is a Python library for vector space modeling and includes tf—idf weighting. 

Robust Hyperlinking : An application of tf—idf for stable document addressability. 

A demo of using tf—idf wth PHP and Euclidean distance for Classification 

Anatomy of a search engine 

tf-idf and related definitions as used in Lucene 

TfidfTransformer in scikit-learn 

Text to Matrix Generator (TMG) MATLAB toolbox that can be used for various tasks in text mining (TM) 
specifically i) indexing, ii) retrieval, iii) dimensionality reduction, iv) clustering, v) classification. The indexing step 
offers the user the ability to apply local and global weighting methods, including tf—idf. 

Pyevolve: A tutorial series explaining the tf-idf calculation 

TF/IDF with Google n-Grams and POS Tags 
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